Data Source : https://data.bts.gov/Research-and-Statistics/Freight-Transportation-Service-Index/n68x-u7m7

Description:

The aim of this project is to develop a forecasting model for the Transportation Services Index (TSI) based on historical data. The TSI, produced by the Bureau of Transportation Statistics (BTS), serves as a measure of the volume of freight and passenger transportation services moved monthly by the for-hire transportation sector in the United States. This index is composed of three components: a freight index, a passenger index, and a combined index, each incorporating data from various for-hire transportation modes.

The primary focus of this project is to forecast the TSI values exclusively, while disregarding the other components of the index. Changes in the TSI reflect fluctuations excluding the external schoks in the demand for transportation services, which are indicative of broader economic trends. For instance, during periods of economic expansion, there is typically an increase in the demand for goods and services, leading to a corresponding rise in the TSI.

To achieve this objective, we will utilize time series analysis techniques to model the historical TSI data and generate forecasts for future TSI values. Time series analysis allows us to identify patterns, trends, and seasonal variations in the data, which are crucial for developing accurate forecasting models. We will explore various time series forecasting methods, such as autoregressive integrated moving average (ARIMA), exponential smoothing methods.

The forecasted TSI values will provide valuable insights for stakeholders in the transportation industry, policymakers, and economic analysts. By anticipating changes in the demand for transportation services, informed decisions can be made regarding infrastructure investments, capacity planning, and economic policy formulation. Additionally, understanding future trends in transportation demand can help businesses optimize their supply chain management strategies and logistics operations.

Overall, this project aims to contribute to a better understanding of the dynamics of transportation services demand and provide actionable forecasts to support decision-making processes in both the public and private sectors.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
transportation <- read_csv("Transportation_Services_Index_and_Seasonally.csv")
## Rows: 289 Columns: 66
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): ID, OBS_DATE
## dbl (64): ASM, ASM_D11, ASM_D, ASM_D_D11, ASM_I, ASM_I_D11, RPM, RPM_D11, RP...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transportation
## # A tibble: 289 × 66
##    ID         OBS_DATE      ASM ASM_D11  ASM_D ASM_D_D11  ASM_I ASM_I_D11    RPM
##    <chr>      <chr>       <dbl>   <dbl>  <dbl>     <dbl>  <dbl>     <dbl>  <dbl>
##  1 SATD200001 01/01/2000 7.61e7  7.79e7 5.63e7  57569024 1.99e7  20363265 4.83e7
##  2 SATD200002 02/01/2000 7.30e7  7.86e7 5.43e7  57825860 1.87e7  20735310 4.86e7
##  3 SATD200003 03/01/2000 8.03e7  7.94e7 5.99e7  58447159 2.04e7  20919253 5.96e7
##  4 SATD200004 04/01/2000 7.81e7  7.94e7 5.74e7  58407037 2.07e7  20962932 5.76e7
##  5 SATD200005 05/01/2000 8.05e7  7.91e7 5.89e7  58182279 2.16e7  20908676 5.97e7
##  6 SATD200006 06/01/2000 8.04e7  7.89e7 5.83e7  57589554 2.21e7  21294849 6.40e7
##  7 SATD200007 07/01/2000 8.39e7  7.91e7 6.06e7  57759243 2.32e7  21307529 6.64e7
##  8 SATD200008 08/01/2000 8.47e7  7.93e7 6.15e7  57934341 2.32e7  21386027 6.53e7
##  9 SATD200009 09/01/2000 7.99e7  8.15e7 5.79e7  59928453 2.20e7  21587317 5.55e7
## 10 SATD200010 10/01/2000 8.26e7  8.12e7 6.08e7  59426878 2.19e7  21731495 5.80e7
## # ℹ 279 more rows
## # ℹ 57 more variables: RPM_D11 <dbl>, RPM_D <dbl>, RPM_D_D11 <dbl>,
## #   RPM_I <dbl>, RPM_I_D11 <dbl>, LOAD_FACTOR <dbl>, LOAD_FACTOR_D11 <dbl>,
## #   LOAD_FACTOR_D <dbl>, LOAD_FACTOR_D_D11 <dbl>, LOAD_FACTOR_I <dbl>,
## #   LOAD_FACTOR_I_D11 <dbl>, ENPLANE_I <dbl>, ENPLANE_I_D11 <dbl>,
## #   ENPLANE_D <dbl>, ENPLANE_D_D11 <dbl>, ENPLANE <dbl>, ENPLANE_D11 <dbl>,
## #   AIR_RPM_TSI <dbl>, AIR_RPM_TSI_D11 <dbl>, AIR_RTMFM <dbl>, …

From the above data set we are interested in performing the analysis on the TSI_Total variable.

Below are the time series, acf, pacf and covariance plots of tsi_total variable

ts_transportation <- ts(transportation$TSI_Total, start = c(2000, 1), frequency = 12, end = c(2023, 12))

plot(ts_transportation,main= " ",
xlab= "Time (months)", ylab="TSI_Total",type="o")

par(mfrow = c(2, 2))
ts.plot(ts_transportation)
acf(ts_transportation)
pacf(ts_transportation)
acf(ts_transportation,type="covariance") 

par(mfrow = c(2, 2))
plot(diff(ts_transportation))
acf(diff( ts_transportation))
pacf(diff(ts_transportation))
acf(diff(ts_transportation),type="covariance") 

From the above regular time series plot we can see that the data is not stationary as the trend is moving upward gradually however fallen down during the covid period also, the acf and pacf plots does not looks stationary as there are significant lags over the period. After differencing the data we can see the data is stationary except at the covid time also the acf and pacf of the differenced time period looks good where as for the pacf plot indicates that the lag is significant at lag 1.

To make the plot normal logarithm transformation was applied on the data however it does not normalize the data close to zero which we can observe in the acf and pacf plots.

par(mfrow = c(2, 2))
ts.plot(log(ts_transportation))
acf(log(ts_transportation))
pacf(log(ts_transportation))
acf(log(ts_transportation),type="covariance") 

As our analysis is mostly focusing with no external skocks. Filtered the data set till the year 2020 where it faced the external effect of pandemic later to it.

Following are the plots to identify and analyze the correlation structure of time series filtered data where we can observe that the acf plot has many significant lags where as for the pacf plot we can observe that the first lag is significant.

ts_transportation_filtered <- ts_transportation[1:204]
par(mfrow = c(2, 2))
ts.plot(ts_transportation_filtered)
acf(ts_transportation_filtered)
pacf(ts_transportation_filtered)
acf(ts_transportation_filtered,type="covariance") 

Following are the moving average for three and six months respectively on the time series data to understand the smoothing feature of the data.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# Calculate 3-month moving average
moving_avg <- rollmean(ts_transportation_filtered, k = 3, align = "center", fill = NA)

# Plot original time series and moving average
plot(ts_transportation_filtered, main="Original Time Series vs. 3-Month Moving Average", ylab="Transportation Index")
lines(moving_avg, col="red")
legend("topright", legend=c("Original", "Moving Average"), col=c("black", "red"), lty=1)

# Calculate 6-month moving average
moving_avg <- rollmean(ts_transportation_filtered, k = 6, align = "center", fill = NA)

# Plot original time series and moving average
plot(ts_transportation_filtered, main="Original Time Series vs. 6-Month Moving Average", ylab="Transportation Index")
lines(moving_avg, col="red")
legend("topright", legend=c("Original", "Moving Average"), col=c("black", "red"), lty=1)

Following are the time series, acf and pacf plots of the filtered time series plot and the data looks stationary from the acf plot first lag is significant where as the pacf plot has first lag is significant.

par(mfrow = c(2, 2))
ts.plot(diff(ts_transportation_filtered))
acf(diff(ts_transportation_filtered))
pacf(diff(ts_transportation_filtered))
acf(diff(ts_transportation_filtered),type="covariance") 

adf.test((ts_transportation_filtered))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  (ts_transportation_filtered)
## Dickey-Fuller = -2.1091, Lag order = 5, p-value = 0.5304
## alternative hypothesis: stationary
adf.test(diff(ts_transportation_filtered))
## Warning in adf.test(diff(ts_transportation_filtered)): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(ts_transportation_filtered)
## Dickey-Fuller = -4.6921, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

After performing the statistical test of adf to check for the stationarity I see the data is not stationary for the non differenced data as the result is insignificant and the p-value is above 0.05 and for the difference data we can observe that it is stationary as the result is significant and reject the null hypothesis of stationarity.

Splitted the data into the train and the test data sets

train_transportation <- ts_transportation_filtered[1:180] # take filtered data 
test_transportation  <- ts_transportation_filtered[180 : 203]

time_train_transportation  <- time(train_transportation)
time_test_transportation   <- time(test_transportation)

Applied the linear models on top of the data where the time series data is the response variable whereas the time object is on the predicotr variable.

mlr.lin = lm( train_transportation ~ time_train_transportation)
summary(mlr.lin)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8749  -2.2539  -0.0125   3.4972   6.8250 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               99.257784   0.606496  163.66   <2e-16 ***
## time_train_transportation  0.101921   0.005812   17.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.052 on 178 degrees of freedom
## Multiple R-squared:  0.6334, Adjusted R-squared:  0.6313 
## F-statistic: 307.5 on 1 and 178 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.lin, main="",which = 1:4)

The linear regression model indicates a strong positive relationship between time and train transportation, with each unit increase in time associated with a 0.102 unit increase in train transportation, holding other variables constant. The model explains approximately 63.34% of the variability in train transportation, with both the intercept and time coefficient being statistically significant predictors (p < 0.001). However, the residual plots does not looks good as there is a huge spikes in the residual plot from center.

time_sqft_train_transportation = time_train_transportation^2/factorial(2)
mlr.quad=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation ) 
summary(mlr.quad)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     time_sqft_train_transportation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0657  -2.1983   0.1021   3.4398   6.6659 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    98.7801811  0.9174515 107.668  < 2e-16 ***
## time_train_transportation       0.1176665  0.0234037   5.028 1.21e-06 ***
## time_sqft_train_transportation -0.0001740  0.0002505  -0.695    0.488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.057 on 177 degrees of freedom
## Multiple R-squared:  0.6344, Adjusted R-squared:  0.6303 
## F-statistic: 153.6 on 2 and 177 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.quad, main="",which = 1:4)

The multiple linear regression model shows that for each unit increase in time_train_transportation, train transportation increases by approximately 0.118 units, holding other variables constant. However, the coefficient for time_sqft_train_transportation is not statistically significant (p = 0.488), suggesting it does not significantly impact train transportation. The overall model, including both predictors, explains approximately 63.03% of the variability in train transportation, with a statistically significant F-statistic (< 2.2e-16). However, the residual plots does not looks good as there is a huge spikes in the residual plot from the center and is not normal.

time_cubic_train_transportation= time_train_transportation^3/factorial(3)
mlr.cubic=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation ) 
summary(mlr.cubic)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     time_sqft_train_transportation + time_cubic_train_transportation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3332  -2.3110   0.5942   2.8546   7.9898 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9.345e+01  1.084e+00  86.223  < 2e-16 ***
## time_train_transportation        4.663e-01  5.171e-02   9.017 3.22e-16 ***
## time_sqft_train_transportation  -9.778e-03  1.326e-03  -7.375 6.19e-12 ***
## time_cubic_train_transportation  1.061e-04  1.445e-05   7.346 7.34e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.56 on 176 degrees of freedom
## Multiple R-squared:  0.7202, Adjusted R-squared:  0.7154 
## F-statistic:   151 on 3 and 176 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.cubic, main="",which = 1:4)

The multiple linear regression model with time_train_transportation, time_sqft_train_transportation, and time_cubic_train_transportation as predictors indicates that, for each unit increase in time_train_transportation, train transportation increases by approximately 0.466 units, holding other variables constant. Additionally, both time_sqft_train_transportation and time_cubic_train_transportation have statistically significant negative coefficients, suggesting that increases in these variables are associated with decreases in train transportation. The overall model explains approximately 71.54% of the variability in train transportation, with a significant F-statistic (p < 2.2e-16). However, the residual plots does not looks good as there is a huge spikes in the residual plot from the center and is not normal.

time_four_train_transportation= time_train_transportation^4/factorial(4)
mlr.four=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation+time_four_train_transportation ) 
summary(mlr.four)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     time_sqft_train_transportation + time_cubic_train_transportation + 
##     time_four_train_transportation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.085  -1.595   1.072   2.263   5.097 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9.849e+01  1.228e+00  80.230  < 2e-16 ***
## time_train_transportation       -7.737e-02  9.343e-02  -0.828    0.409    
## time_sqft_train_transportation   1.708e-02  4.182e-03   4.085 6.70e-05 ***
## time_cubic_train_transportation -5.851e-04  1.040e-04  -5.626 7.19e-08 ***
## time_four_train_transportation   7.638e-06  1.140e-06   6.698 2.77e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.185 on 175 degrees of freedom
## Multiple R-squared:  0.7773, Adjusted R-squared:  0.7722 
## F-statistic: 152.7 on 4 and 175 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.four, main="",which = 1:4)

The multiple linear regression model with time_train_transportation, time_sqft_train_transportation, time_cubic_train_transportation, and time_four_train_transportation as predictors indicates that only time_sqft_train_transportation and time_cubic_train_transportation have statistically significant coefficients. Specifically, for each unit increase in time_sqft_train_transportation and time_four_train_transportation, train transportation increases by approximately 0.01708 and decreases by approximately 0.0005851, respectively. However, the residual plots does not looks good as there is a huge spikes in the residual plot from the center and is not normal.

Following are the plots fitted by all the linear, quadratic, cubic and bi-quadratic polynomial.

par(mfrow=c(2,2))
ts.plot(train_transportation) # Time Series Plot
# Plot of xfit vs mlr.lin$fitted
plin=cbind(train_transportation,mlr.lin$fitted)
ts.plot(plin,main="xfit and fit.linear")
pquad=cbind(train_transportation,mlr.quad$fitted)
ts.plot(pquad,main="xfit and fit.quadratic")
pcub=cbind(train_transportation,mlr.cubic$fitted)
ts.plot(pcub,main="xfit and fitt.cubic")

# Create a dataframe to store the model summary information
model_summary <- data.frame(
  Model = c("Linear", "Quadratic", "Cubic", "4th Degree"),
  Multiple_R_squared = c(0.6334, 0.6344, 0.7202, 0.7773),
  Adjusted_R_squared = c(0.6313, 0.6303, 0.7154, 0.7722),
  F_statistic = c(307.5, 153.6, 151, 152.7),
  Residual_DF = c(178, 177, 176, 175),
  Residual_standard_error = c(4.052, 4.057, 3.56, 3.185)
)

# Print the dataframe
print(model_summary)
##        Model Multiple_R_squared Adjusted_R_squared F_statistic Residual_DF
## 1     Linear             0.6334             0.6313       307.5         178
## 2  Quadratic             0.6344             0.6303       153.6         177
## 3      Cubic             0.7202             0.7154       151.0         176
## 4 4th Degree             0.7773             0.7722       152.7         175
##   Residual_standard_error
## 1                   4.052
## 2                   4.057
## 3                   3.560
## 4                   3.185

This table summarizes the performance of different regression models based on their multiple R-squared, adjusted R-squared, F-statistic, residual degrees of freedom, and residual standard error.

AIC.lin = AIC(mlr.lin)
AIC.quad = AIC(mlr.quad)
AIC.cubic = AIC(mlr.cubic)
print(AIC.lin)
## [1] 1018.483
print(AIC.quad)
## [1] 1019.993
print(AIC.cubic)
## [1] 973.8577
print("---------------")
## [1] "---------------"
BIC.lin = BIC(mlr.lin)
BIC.quad = BIC(mlr.quad)
BIC.cubic = BIC(mlr.cubic)
print(BIC.lin)
## [1] 1028.062
print(BIC.quad)
## [1] 1032.765
print(BIC.cubic)
## [1] 989.8224

The code calculates the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for three different regression models: linear, quadratic, and cubic.

Overall, both AIC and BIC suggest that the cubic model provides the best fit among the three models.

#linear
new <- data.frame(time_train_transportation = 205:288)
# use predict function
pfore.lin=predict(mlr.lin,new,se.fit = TRUE)
efore.lin=test_transportation-pfore.lin$fit 
## Warning in test_transportation - pfore.lin$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)

me.lin=mean(efore.lin) # Mean Error
mpe.lin=100*(mean(efore.lin/test_transportation)) # Mean Percent Error
## Warning in efore.lin/test_transportation: longer object length is not a
## multiple of shorter object length
mse.lin=sum(efore.lin**2)/nfore # Mean Squared Error
mae.lin=mean(abs(efore.lin)) # Mean Absolute Error
mape.lin=100*(mean(abs((efore.lin)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.lin)/test_transportation: longer object length is not a
## multiple of shorter object length
me.lin
## [1] -1.540922
mpe.lin
## [1] -1.25843
mse.lin
## [1] 31.23893
mae.lin
## [1] 2.455101
mape.lin
## [1] 2.000836

The linear regression model exhibits a slight negative bias, with an average error of approximately -1.54. The average percentage error relative to the actual values is approximately -1.26%. The model’s predictions have an average squared error of approximately 31.24 and an average absolute error of approximately 2.46. The average absolute percentage error relative to the actual values is approximately 2.00%.

#quadratic
matq=matrix(c(time_train_transportation,time_sqft_train_transportation),nrow=84,ncol=2,dimnames = list(c(),c("time_train_transportation","time_sqft_train_transportation")))
## Warning in matrix(c(time_train_transportation, time_sqft_train_transportation),
## : data length [360] is not a sub-multiple or multiple of the number of rows
## [84]
matq <- data.frame(matq)
pfore.quad=predict(mlr.quad,matq,se.fit = TRUE)
efore.quad=test_transportation-pfore.quad$fit
## Warning in test_transportation - pfore.quad$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)

me.quad=mean(efore.quad) # Mean Error
mpe.quad=100*(mean(efore.quad/test_transportation)) # Mean Percent Error
## Warning in efore.quad/test_transportation: longer object length is not a
## multiple of shorter object length
mse.quad=sum(efore.quad**2)/nfore # Mean Squared Error
mae.quad=mean(abs(efore.quad)) # Mean Absolute Error
mape.quad=100*(mean(abs((efore.quad)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.quad)/test_transportation: longer object length is not a
## multiple of shorter object length
me.quad
## [1] 19.08148
mpe.quad
## [1] 15.53051
mse.quad
## [1] 1304.153
mae.quad
## [1] 19.08148
mape.quad
## [1] 15.53051

For the quadratic regression model, the average error is approximately 19.08 units. The average percentage error relative to the actual values is approximately 15.53%. The model’s predictions have an average squared error of approximately 1304.15 and an average absolute error of approximately 19.08 units. The average absolute percentage error relative to the actual values is approximately 15.53%.

#cubic

matq=matrix(c(time_train_transportation,time_sqft_train_transportation, time_cubic_train_transportation),nrow=84,ncol=3,dimnames = list(c(),c("time_train_transportation","time_sqft_train_transportation", "time_cubic_train_transportation")))
## Warning in matrix(c(time_train_transportation, time_sqft_train_transportation,
## : data length [540] is not a sub-multiple or multiple of the number of rows
## [84]
matq <- data.frame(matq)

pfore.cubic=predict(mlr.cubic,matq,se.fit = TRUE)
efore.cubic=test_transportation-pfore.cubic$fit
## Warning in test_transportation - pfore.cubic$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)

me.cubic=mean(efore.cubic) # Mean Error
mpe.cubic=100*(mean(efore.cubic/test_transportation)) # Mean Percent Error
## Warning in efore.cubic/test_transportation: longer object length is not a
## multiple of shorter object length
mse.cubic=sum(efore.cubic**2)/nfore # Mean Squared Error
mae.cubic=mean(abs(efore.cubic)) # Mean Absolute Error
mape.cubic=100*(mean(abs((efore.cubic)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.cubic)/test_transportation: longer object length is not a
## multiple of shorter object length
me.cubic
## [1] 10.72762
mpe.cubic
## [1] 8.733014
mse.cubic
## [1] 835.2739
mae.cubic
## [1] 12.64123
mape.cubic
## [1] 10.29451

For the cubic regression model, the average error is approximately 10.73 units. The average percentage error relative to the actual values is approximately 8.73%. The model’s predictions have an average squared error of approximately 835.27 and an average absolute error of approximately 12.64 units. The average absolute percentage error relative to the actual values is approximately 8.73%

library(dplyr)

# Create data frames for AIC and BIC values
aic_bic <- data.frame(
  Model = c("Linear", "Quadratic", "Cubic"),
  AIC = c(1018.483, 1028.062, 973.8577),
  BIC = c(1019.993, 1032.765, 989.8224)
)

# Create data frames for model metrics
linear_metrics <- data.frame(
  Model = "Linear",
  ME = -1.540922,
  MPE = -1.25843,
  MSE = 31.23893,
  MAE = 2.455101,
  MAPE = 2.000836
)

quadratic_metrics <- data.frame(
  Model = "Quadratic",
  ME = 19.08148,
  MPE = 15.53051,
  MSE = 1304.153,
  MAE = 19.08148,
  MAPE = 15.53051
)

cubic_metrics <- data.frame(
  Model = "Cubic",
  ME = 10.72762,
  MPE = 8.733014,
  MSE = 835.2739,
  MAE = 12.64123,
  MAPE = 10.29451
)

# Combine all data frames into a single table
combined_table <- bind_rows(aic_bic, linear_metrics, quadratic_metrics, cubic_metrics)

# Print the combined table
print(combined_table)
##       Model       AIC       BIC        ME       MPE        MSE       MAE
## 1    Linear 1018.4830 1019.9930        NA        NA         NA        NA
## 2 Quadratic 1028.0620 1032.7650        NA        NA         NA        NA
## 3     Cubic  973.8577  989.8224        NA        NA         NA        NA
## 4    Linear        NA        NA -1.540922 -1.258430   31.23893  2.455101
## 5 Quadratic        NA        NA 19.081480 15.530510 1304.15300 19.081480
## 6     Cubic        NA        NA 10.727620  8.733014  835.27390 12.641230
##        MAPE
## 1        NA
## 2        NA
## 3        NA
## 4  2.000836
## 5 15.530510
## 6 10.294510

Based on the provided table, the cubic model appears to be the best choice among the three models (linear, quadratic, and cubic). It has the lowest AIC and BIC values, indicating better fit compared to the other models. Additionally, it has lower error metrics such as mean absolute error (MAE) and mean squared error (MSE), suggesting better predictive performance.

Following is the code to fit the seasonality

per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))

time_cubic_train_transportation= time_train_transportation^3/factorial(3)
mlr.cubic.month=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation +month ) 
summary(mlr.cubic.month)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     time_sqft_train_transportation + time_cubic_train_transportation + 
##     month)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4747  -1.9656   0.5455   2.9171   7.7183 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      9.372e+01  1.408e+00  66.559  < 2e-16 ***
## time_train_transportation        4.713e-01  5.350e-02   8.809 1.67e-15 ***
## time_sqft_train_transportation  -9.907e-03  1.372e-03  -7.222 1.79e-11 ***
## time_cubic_train_transportation  1.075e-04  1.495e-05   7.194 2.10e-11 ***
## month2                           1.848e-01  1.338e+00   0.138    0.890    
## month3                          -9.648e-02  1.338e+00  -0.072    0.943    
## month4                          -3.972e-01  1.338e+00  -0.297    0.767    
## month5                          -2.108e-01  1.339e+00  -0.157    0.875    
## month6                          -2.973e-01  1.339e+00  -0.222    0.825    
## month7                          -3.570e-01  1.339e+00  -0.267    0.790    
## month8                          -5.233e-01  1.340e+00  -0.391    0.697    
## month9                          -7.162e-01  1.340e+00  -0.534    0.594    
## month10                         -7.825e-01  1.341e+00  -0.584    0.560    
## month11                         -5.690e-01  1.342e+00  -0.424    0.672    
## month12                         -6.492e-01  1.342e+00  -0.484    0.629    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.664 on 165 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.6984 
## F-statistic: 30.61 on 14 and 165 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.cubic.month, main="",which = 1:4)

From the above summary we can observe that the monthly indicators are not significant and has no relationship with the time series data.

per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))

mlr.quad.month=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation +month ) 
summary(mlr.quad.month)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     time_sqft_train_transportation + month)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1380  -2.2869   0.1722   3.4551   6.6203 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    98.8207233  1.3895754  71.116  < 2e-16 ***
## time_train_transportation       0.1177538  0.0241561   4.875 2.53e-06 ***
## time_sqft_train_transportation -0.0001741  0.0002585  -0.674    0.501    
## month2                          0.2438019  1.5289974   0.159    0.874    
## month3                          0.0211113  1.5290342   0.014    0.989    
## month4                         -0.2214052  1.5290948  -0.145    0.885    
## month5                          0.0229192  1.5291789   0.015    0.988    
## month6                         -0.0059157  1.5292862  -0.004    0.997    
## month7                         -0.0079097  1.5294165  -0.005    0.996    
## month8                         -0.1163963  1.5295699  -0.076    0.939    
## month9                         -0.2513754  1.5297464  -0.164    0.870    
## month10                        -0.2595137  1.5299464  -0.170    0.866    
## month11                         0.0125222  1.5301701   0.008    0.993    
## month12                        -0.0086012  1.5304182  -0.006    0.996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.187 on 166 degrees of freedom
## Multiple R-squared:  0.6348, Adjusted R-squared:  0.6062 
## F-statistic:  22.2 on 13 and 166 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.quad.month, main="",which = 1:4)

From the above summary we can observe that the monthly indicators are not significant and has no relationship with the time series data and the residual plots are not significant.

per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))

mlr.lin.month=lm( train_transportation ~ time_train_transportation +month ) 
summary(mlr.lin.month)
## 
## Call:
## lm(formula = train_transportation ~ time_train_transportation + 
##     month)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9479  -2.3026   0.0208   3.6013   6.7812 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               99.297173   1.194169  83.152   <2e-16 ***
## time_train_transportation  0.101994   0.006010  16.971   <2e-16 ***
## month2                     0.244673   1.526495   0.160    0.873    
## month3                     0.022679   1.526530   0.015    0.988    
## month4                    -0.219315   1.526589  -0.144    0.886    
## month5                     0.025357   1.526672   0.017    0.987    
## month6                    -0.003304   1.526778  -0.002    0.998    
## month7                    -0.005298   1.526909  -0.003    0.997    
## month8                    -0.113958   1.527062  -0.075    0.941    
## month9                    -0.249286   1.527240  -0.163    0.871    
## month10                   -0.257946   1.527441  -0.169    0.866    
## month11                    0.013393   1.527665   0.009    0.993    
## month12                   -0.008601   1.527914  -0.006    0.996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.18 on 167 degrees of freedom
## Multiple R-squared:  0.6338, Adjusted R-squared:  0.6075 
## F-statistic: 24.09 on 12 and 167 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)) 
plot(mlr.lin.month, main="",which = 1:4)

From the above summary we can observe that the monthly indicators are not significant and has no relationship with the time series data and the residual plots are not significant.

Following is the code to fit the cos and sin trignomietric functions.

t=1:length(train_transportation)
per=12
# Model 1
c1=cos(2*pi*t/per)
s1=sin(2*pi*t/per)
accd_trig1=lm(time_train_transportation~c1+s1)
summary(accd_trig1)
## 
## Call:
## lm(formula = time_train_transportation ~ c1 + s1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88.50 -46.77   0.00  46.77  88.50 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   90.500      3.900  23.204   <2e-16 ***
## c1             1.000      5.516   0.181    0.856    
## s1            -3.732      5.516  -0.677    0.500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.33 on 177 degrees of freedom
## Multiple R-squared:  0.002765,   Adjusted R-squared:  -0.008504 
## F-statistic: 0.2453 on 2 and 177 DF,  p-value: 0.7827
c2=cos(2*pi*2*t/per)
s2=sin(2*pi*2*t/per)
accd_trig2=lm(time_train_transportation~c1+s1+c2+s2)
summary(accd_trig2)
## 
## Call:
## lm(formula = time_train_transportation ~ c1 + s1 + c2 + s2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -87.5  -47.3    0.0   47.3   87.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   90.500      3.921  23.081   <2e-16 ***
## c1             1.000      5.545   0.180    0.857    
## s1            -3.732      5.545  -0.673    0.502    
## c2             1.000      5.545   0.180    0.857    
## s2            -1.732      5.545  -0.312    0.755    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.61 on 175 degrees of freedom
## Multiple R-squared:  0.003505,   Adjusted R-squared:  -0.01927 
## F-statistic: 0.1539 on 4 and 175 DF,  p-value: 0.961

From the above both the summaries we can observe that there is no effect and relationship of the c1, c2, s1 and s2 on the data

Following is the code to fit the auto auto regressive integrated moving average(arima) functions to figure out the best order to fit the model based on the data.

library(forecast)
library(astsa)
## 
## Attaching package: 'astsa'
## The following object is masked from 'package:forecast':
## 
##     gas
auto.arima(train_transportation,ic="bic")
## Series: train_transportation 
## ARIMA(0,1,0) 
## 
## sigma^2 = 1.431:  log likelihood = -286.05
## AIC=574.1   AICc=574.12   BIC=577.29

The ARIMA(0,1,0) model was fitted to the series “train_transportation.” The estimated variance of the residuals (sigma^2) is 1.431, and the log likelihood of the model is -286.05 based on the BIC values.

auto.arima(train_transportation,ic="aic")
## Series: train_transportation 
## ARIMA(3,1,2) with drift 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2  drift
##       -0.4840  -0.8287  0.0163  0.3438  0.8708  0.111
## s.e.   0.1138   0.1054  0.0874  0.0862  0.0929  0.082
## 
## sigma^2 = 1.34:  log likelihood = -277.47
## AIC=568.94   AICc=569.6   BIC=591.25

The auto.arima function identified an ARIMA(3,1,2) model with drift as the best-fitting model for the “train_transportation” series based on the Akaike Information Criterion (AIC). This model includes three autoregressive terms (AR) at lag 1, 2, and 3, and two moving average terms (MA) at lag 1 and 2, along with a drift term. The estimated coefficients for the AR and MA terms are provided, along with their standard errors. The estimated variance of the residuals (sigma^2) is 1.34, and the log likelihood of the model is -277.47. Lower values of AIC, AICc, and BIC indicate better model fit.

d <- 1  # no differencing needed
np <- 3 # maximum AR order
nq <- 3 # maximum MA order
outarima.all <- matrix(nrow=(np+1)*(nq+1), ncol=3)
colnames(outarima.all) <- c("p", "q", "BIC")

# Check for missing values
if (any(is.na(train_transportation))) {
  stop("Input data contains missing values.")
}

for (p in 0:np) {
  for (q in 0:nq) {
    # Try fitting SARIMA model
    tryCatch({
      gnpgr.fit <- sarima(train_transportation, p, d, q, details = FALSE)
      outarima.all[p*(nq+1) + q + 1, ] <- c(p, q, gnpgr.fit$BIC)
    }, error = function(e) {
      cat("Error occurred for p =", p, "and q =", q, ":", conditionMessage(e), "\n")
    })
  }
}

outarima.all
##       p q      BIC
##  [1,] 0 0 3.244849
##  [2,] 0 1 3.250873
##  [3,] 0 2 3.272257
##  [4,] 0 3 3.274025
##  [5,] 1 0 3.248966
##  [6,] 1 1 3.277943
##  [7,] 1 2 3.278535
##  [8,] 1 3 3.302465
##  [9,] 2 0 3.277938
## [10,] 2 1 3.291620
## [11,] 2 2 3.274307
## [12,] 2 3 3.328085
## [13,] 3 0 3.270690
## [14,] 3 1 3.295343
## [15,] 3 2 3.303092
## [16,] 3 3 3.326566
outarima.all[which.min(outarima.all[,3]),]
##        p        q      BIC 
## 0.000000 0.000000 3.244849

The code snippet performs a grid search to identify the best-fitting SARIMA model for the “train_transportation” series based on the Bayesian Information Criterion (BIC). It iterates over different combinations of AR (p) and MA (q) orders, ranging from 0 to 3. For each combination, it attempts to fit a SARIMA model and stores the resulting BIC value in a matrix. If an error occurs during model fitting, it handles the error and continues the iteration. After completing the grid search, it selects the model with the lowest BIC value as the best-fitting model, which corresponds to an ARIMA(0,1,0) model. This model indicates that no differencing is needed, with no autoregressive or moving average terms, as indicated by p = 0 and q = 0, respectively.

freight_auto_arima = arima(train_transportation,order=c(0,1,0), method='CSS',
include.mean=TRUE)
summary(freight_auto_arima)
## 
## Call:
## arima(x = train_transportation, order = c(0, 1, 0), include.mean = TRUE, method = "CSS")
## 
## 
## sigma^2 estimated as 1.431:  part log likelihood = -286.05
## 
## Training set error measures:
##                     ME     RMSE   MAE        MPE      MAPE      MASE       ACF1
## Training set 0.1138889 1.192826 0.885 0.09539694 0.8264137 0.9944444 -0.1568116
ar1.resid=na.omit(freight_auto_arima$resid)
ar1.resid
## Time Series:
## Start = 1 
## End = 180 
## Frequency = 1 
##   [1]  0.0 -0.4 -1.9 -1.2  1.5  0.2 -1.8  1.3  1.3 -0.6  0.3 -1.8  1.2 -0.1  0.2
##  [16] -1.0  1.1 -0.7 -0.7  1.3 -6.3  1.3  0.3  0.4  1.0  1.5 -0.9  0.7  0.9  0.4
##  [31]  0.7  0.0  0.4  0.9  0.7  0.1 -0.6 -0.5 -0.1 -0.9 -0.3  0.3  1.6 -0.3  1.2
##  [46]  1.2  0.2  2.1 -0.8  0.9  2.0  0.3 -0.1  0.8 -0.1 -0.6  0.4  0.9  0.5  0.2
##  [61]  1.8 -0.5 -0.2  0.5 -0.3 -0.5  0.1  0.3 -0.5 -0.7  2.2 -1.2  0.6 -0.6  0.3
##  [76] -0.3  1.6 -0.3 -1.0 -1.9  2.8 -1.6 -1.2  1.6 -0.3  0.3  1.3 -1.1  0.3 -0.4
##  [91] -0.7  0.7  0.1  1.2 -0.2 -0.2  2.0 -1.6 -0.9  1.4 -0.1 -1.3  1.0 -1.8 -2.1
## [106]  0.0 -2.6 -2.8 -1.8  2.3 -4.5 -0.1 -0.7  1.3  1.8  0.9  0.1 -0.9  1.4  1.0
## [121]  1.0  1.0  0.9  0.1  0.6  0.3  0.3  0.1  0.6  0.7 -0.4  0.5  1.2 -0.6  1.2
## [136] -0.5 -0.7  1.0  0.1 -0.4  1.0  0.4  0.3  2.0 -2.4  0.7 -0.3  0.0  0.5  0.5
## [151] -0.1 -0.3 -0.4 -1.8  1.6  0.4  1.9  1.2 -0.7 -0.3  0.5  0.7 -0.4  0.4  0.0
## [166]  0.1  2.0 -1.3 -1.0  1.6  1.8  0.3  0.4 -1.2  0.7  0.2  0.9  0.3  0.5  0.2
par(mfrow=c(2,2)) 
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )

hist(ar1.resid,main="histogram",xlab="resid")

qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)

shapiro.test(ar1.resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  ar1.resid
## W = 0.93338, p-value = 2.256e-07
acf(as.numeric(ar1.resid), lag.max=40, main="")

Box.test(ar1.resid, lag=4, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 10.329, df = 3, p-value = 0.01596
Box.test(ar1.resid, lag=8, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 16.103, df = 7, p-value = 0.02419
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
## 
##  Box-Ljung test
## 
## data:  ar1.resid
## X-squared = 10.569, df = 3, p-value = 0.0143

The p-value obtained (2.256e-07) suggests significant departure from normality, indicating non-normality in the residuals.

The Box-Pierce and Box-Ljung tests evaluate the autocorrelation of the residuals. Both tests yield p-values below conventional significance levels (0.05), suggesting that there is significant autocorrelation present in the residuals. This indicates that the residuals are not independent, violating one of the assumptions of the ARIMA model.

rec.yw.fore <- predict(freight_auto_arima, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4
## [13] 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4
## 
## $se
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 1.196153 1.691616 2.071798 2.392306 2.674680 2.929965 3.164724 3.383232
##  [9] 3.588459 3.782568 3.967191 4.143596 4.312791 4.475595 4.632681 4.784612
## [17] 4.931865 5.074848 5.213910 5.349359 5.481462 5.610455 5.736549 5.859929
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit

efore.freight_auto_arima=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)

me.freight_auto_arima=mean(efore.freight_auto_arima) # Mean Error
mpe.freight_auto_arima=100*(mean(efore.freight_auto_arima/test_transportation)) # Mean Percent Error
mse.freight_auto_arima=sum(efore.freight_auto_arima**2)/nfore # Mean Squared Error
mae.freight_auto_arima=mean(abs(efore.freight_auto_arima)) # Mean Absolute Error
mape.freight_auto_arima=100*(mean(abs((efore.freight_auto_arima)/test_transportation))) # Mean Absolute Percent Error
me.freight_auto_arima
## [1] 0.5041667
mpe.freight_auto_arima
## [1] 0.405107
mse.freight_auto_arima
## [1] 1.029583
mae.freight_auto_arima
## [1] 0.8458333
mape.freight_auto_arima
## [1] 0.6856508

From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close.

Following is the code to fit the order 0, 1, 1 and interpret the results.

freight_prof = arima(train_transportation,order=c(0,1,1), method='CSS',
include.mean=TRUE)
summary(freight_prof)
## 
## Call:
## arima(x = train_transportation, order = c(0, 1, 1), include.mean = TRUE, method = "CSS")
## 
## Coefficients:
##           ma1
##       -0.1296
## s.e.   0.0669
## 
## sigma^2 estimated as 1.403:  part log likelihood = -284.3
## 
## Training set error measures:
##                     ME     RMSE       MAE       MPE      MAPE      MASE
## Training set 0.1306169 1.181218 0.8805218 0.1096842 0.8223829 0.9894125
##                     ACF1
## Training set -0.02390081
ar1.resid=na.omit(freight_prof$resid)
ar1.resid
## Time Series:
## Start = 1 
## End = 180 
## Frequency = 1 
##   [1]  0.000000000 -0.400000000 -1.951826854 -1.452892614  1.311752867
##   [6]  0.369960061 -1.752065335  1.072989914  1.439024229 -0.413549753
##  [11]  0.246417543 -1.768072385  0.970915927  0.025798795  0.203342676
##  [16] -0.973653472  0.973846509 -0.573821498 -0.774348408  1.199669895
##  [21] -6.144562209  0.503866679  0.365284562  0.447328874  1.057959121
##  [26]  1.637076732 -0.687888658  0.610872237  0.979148966  0.526865526
##  [31]  0.768264457  0.099541825  0.412897349  0.953497927  0.823541995
##  [36]  0.206703977 -0.573217958 -0.574270209 -0.174406546 -0.922597356
##  [41] -0.419538296  0.245641625  1.631827082 -0.088568840  1.188524389
##  [46]  1.353993700  0.375433085  2.148643789 -0.521606380  0.832416956
##  [51]  2.107853880  0.573108588 -0.025743962  0.796664429  0.003221528
##  [56] -0.599582596  0.322313801  0.941761276  0.622021310  0.280593519
##  [61]  1.836355698 -0.262068653 -0.233955485  0.469687058 -0.239143994
##  [66] -0.530985202  0.031201769  0.304042724 -0.460606055 -0.759679407
##  [71]  2.101570516 -0.927705529  0.479799852 -0.537833708  0.230314427
##  [76] -0.270158819  1.564996296 -0.097227914 -1.012597542 -2.031199362
##  [81]  2.536823318 -1.271311071 -1.364720133  1.423177122 -0.115603018
##  [86]  0.285021648  1.336929438 -0.926777883  0.179920045 -0.376688275
##  [91] -0.748806421  0.602979297  0.178126300  1.223079314 -0.041529117
##  [96] -0.205380809  1.973389397 -1.344313590 -1.074178860  1.260821723
## [101]  0.063361058 -1.291790489  0.832626407 -1.692118982 -2.319243009
## [106] -0.300497672 -2.638934622 -3.141919199 -2.207089469  2.014033741
## [111] -4.239047418 -0.649241229 -0.784120326  1.198403776  1.955273744
## [116]  1.153339217  0.249434858 -0.867681440  1.287577002  1.166827663
## [121]  1.151182517  1.149155421  1.048892776  0.235902032  0.630565150
## [126]  0.381700520  0.349455843  0.145277992  0.618823253  0.780179156
## [131] -0.298914422  0.461270515  1.259765499 -0.436775793  1.143408212
## [136] -0.351851874 -0.745588439  0.903396242  0.217050463 -0.371877393
## [141]  0.951816912  0.523324190  0.367805616  2.047655520 -2.134691141
## [146]  0.423414185 -0.245139437 -0.031762015  0.495884687  0.564250358
## [151] -0.026891698 -0.303484280 -0.439321589 -1.856921640  1.359403983
## [156]  0.576134079  1.974648042  1.455849490 -0.511369753 -0.366256714
## [161]  0.452545167  0.758634981 -0.301705839  0.360908839  0.046761924
## [166]  0.106058809  2.013741736 -1.039085253 -1.134631299  1.452989073
## [171]  1.988259631  0.557613104  0.472248332 -1.138812137  0.552447374
## [176]  0.271579024  0.935187716  0.421169593  0.554569738  0.271854012
par(mfrow=c(2,2)) 
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )

hist(ar1.resid,main="histogram",xlab="resid")

qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)

shapiro.test(ar1.resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  ar1.resid
## W = 0.93095, p-value = 1.462e-07

From the shapiro test we can observe that the residuals are not normal.

acf(as.numeric(ar1.resid), lag.max=40, main="")

Box.test(ar1.resid, lag=4, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 7.1294, df = 3, p-value = 0.06788
Box.test(ar1.resid, lag=8, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 12.407, df = 7, p-value = 0.08795
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
## 
##  Box-Ljung test
## 
## data:  ar1.resid
## X-squared = 7.3277, df = 3, p-value = 0.06216

The Box-Pierce and Box-Ljung tests assess the autocorrelation of the residuals from the ARIMA(1,0,0) model. The p-values obtained are 0.06788, 0.08795, and 0.06216, respectively. These p-values are all above the conventional significance level of 0.05. Therefore, we fail to reject the null hypothesis that there is no autocorrelation in the residuals. This suggests that the residuals are relatively independent, indicating that the ARIMA(1,0,0) model adequately captures the temporal dependence in the data.

rec.yw.fore <- predict(freight_prof, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
##  [9] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
## [17] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
## 
## $se
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 1.184513 1.570386 1.878604 2.142940 2.378073 2.591963 2.789501 2.973946
##  [9] 3.147602 3.312165 3.468930 3.618911 3.762919 3.901615 4.035547 4.165175
## [17] 4.290888 4.413022 4.531866 4.647671 4.760660 4.871029 4.978953 5.084585
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit

efore.freight_prof=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)

me.freight_prof=mean(efore.freight_prof) # Mean Error
mpe.freight_prof=100*(mean(efore.freight_prof/test_transportation)) # Mean Percent Error
mse.freight_prof=sum(efore.freight_prof**2)/nfore # Mean Squared Error
mae.freight_prof=mean(abs(efore.freight_prof)) # Mean Absolute Error
mape.freight_prof=100*(mean(abs((efore.freight_prof)/test_transportation))) # Mean Absolute Percent Error
me.freight_prof
## [1] 0.53939
mpe.freight_prof
## [1] 0.4337677
mse.freight_prof
## [1] 1.066341
mae.freight_prof
## [1] 0.8575744
mape.freight_prof
## [1] 0.6950459

From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close.

freight_iterations = arima(train_transportation,order=c(3,1,2), method='CSS',
include.mean=TRUE)
freight_iterations
## 
## Call:
## arima(x = train_transportation, order = c(3, 1, 2), include.mean = TRUE, method = "CSS")
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2
##       -0.4059  -0.7420  0.0279  0.2854  0.8775
## s.e.   0.0825   0.0575  0.0699  0.0491  0.0515
## 
## sigma^2 estimated as 1.246:  part log likelihood = -273.65
ar1.resid=na.omit(freight_iterations$resid)
ar1.resid
## Time Series:
## Start = 1 
## End = 180 
## Frequency = 1 
##   [1]  0.000000000  0.000000000  0.000000000  0.000000000 -0.385774905
##   [6]  0.081578850 -0.257081929  0.677667036  0.518692937  0.199867702
##  [11]  0.472602255 -2.469955955  0.998857999  0.925426386 -0.040547356
##  [16] -1.826979413  1.402236695  0.201909693 -1.428106395  0.696136323
##  [21] -5.217688988  0.605002853  0.522574202  0.982149503  0.609859295
##  [26]  1.158488803 -0.426018994  0.524799438  0.698544678  0.649992415
##  [31]  0.712186872 -0.217766898  0.345457579  1.135340598  0.734994579
##  [36] -0.165208196 -0.662913147 -0.354731373 -0.068025269 -0.964171348
##  [41] -0.390733902  0.470762380  1.732802506 -0.327148122  0.829915066
##  [46]  1.470091786  0.438112730  1.623095895 -0.680279207  0.897792997
##  [51]  2.053855695  0.428041018 -0.443704891  0.677233456  0.338245302
##  [56] -0.734988847 -0.027131056  1.272642262  0.839494812 -0.296697119
##  [61]  1.575103602  0.175959321 -0.505276657 -0.012617945  0.215487841
##  [66] -0.295615862 -0.444241989  0.364136349 -0.004170472 -1.001485696
##  [71]  1.825942499 -0.454720156  0.292367435 -0.992664882  0.561873191
##  [76]  0.070535604  1.204397634 -0.287102335 -0.901101997 -2.064081135
##  [81]  2.674862870 -0.797478690 -1.838417455  1.071981919  0.810977477
##  [86]  0.226828317  0.378173698 -0.648286861  0.662893909 -0.751019763
##  [91] -0.976430994  1.048347675  0.433528442  0.735893564 -0.248647746
##  [96]  0.031659403  1.946082718 -1.514131593 -1.335439177  1.501395511
## [101]  0.588489178 -1.762071617  0.345506058 -0.908298910 -2.096348148
## [106] -0.820679652 -2.034268548 -2.496129312 -2.368417134  2.430480126
## [111] -3.439223949 -1.321044300 -0.748926068  2.440140494  1.771885179
## [116] -0.032056638  0.219018999 -0.276192265  0.970390693  0.863117624
## [121]  1.372024790  0.959945567  0.542144663  0.182381744  0.752727049
## [126]  0.217791723  0.141529320  0.196140560  0.674660062  0.644737936
## [131] -0.449440638  0.402808853  1.366052807 -0.474032860  0.769490427
## [136] -0.295217384 -0.586778641  0.737872531  0.304784081 -0.332321713
## [141]  0.711323856  0.594940159  0.421576299  1.748316070 -2.245575179
## [146]  0.308150378  0.030049003  0.185617858  0.178525655  0.497506725
## [151]  0.175334222 -0.470132183 -0.629615697 -1.589964956  1.587131069
## [156]  0.667258868  1.716692680  1.147993021 -0.648227233 -0.569105252
## [161]  0.556559908  1.040441732 -0.521777101 -0.020984260  0.309878489
## [166]  0.337944613  1.661072622 -1.184533465 -1.165999510  1.545831765
## [171]  2.325739591  0.225604940 -0.292441853 -0.979755447  1.037545253
## [176]  0.146201560  0.581913935  0.499845989  0.630736482 -0.018156486
par(mfrow=c(2,2)) 
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )

hist(ar1.resid,main="histogram",xlab="resid")

qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)

shapiro.test(ar1.resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  ar1.resid
## W = 0.95251, p-value = 9.697e-06

Residuals are not normal from the above shapiro wilk test.

acf(as.numeric(ar1.resid), lag.max=40, main="")

Box.test(ar1.resid, lag=4, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 3.8477, df = 3, p-value = 0.2784
Box.test(ar1.resid, lag=8, fitdf=1)
## 
##  Box-Pierce test
## 
## data:  ar1.resid
## X-squared = 5.0103, df = 7, p-value = 0.6587
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
## 
##  Box-Ljung test
## 
## data:  ar1.resid
## X-squared = 3.9671, df = 3, p-value = 0.265
rec.yw.fore <- predict(freight_iterations, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 122.5045 122.3117 122.3180 122.4614 122.3931 122.3146 122.4011 122.4224
##  [9] 122.3474 122.3645 122.4138 122.3790 122.3570 122.3931 122.3938 122.3661
## [17] 122.3778 122.3936 122.3777 122.3728 122.3870 122.3845 122.3748 122.3810
## 
## $se
## Time Series:
## Start = 181 
## End = 204 
## Frequency = 1 
##  [1] 1.116062 1.486268 1.902308 2.267909 2.502995 2.734415 2.992456 3.199795
##  [9] 3.377991 3.574081 3.760161 3.918751 4.078887 4.242898 4.390985 4.531475
## [17] 4.675861 4.814405 4.943905 5.073395 5.201977 5.324056 5.443199 5.562339
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit

efore.freight_iterations=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)

me.freight_iterations = mean(efore.freight_iterations) # Mean Error
mpe.freight_iterations=100*(mean(efore.freight_iterations/test_transportation)) # Mean Percent Error
mse.freight_iterations=sum(efore.freight_iterations**2)/nfore # Mean Squared Error
mae.freight_iterations=mean(abs(efore.freight_iterations)) # Mean Absolute Error
mape.freight_iterations=100*(mean(abs((efore.freight_iterations)/test_transportation))) # Mean Absolute Percent Error
me.freight_iterations
## [1] 0.5212157
mpe.freight_iterations
## [1] 0.4190192
mse.freight_iterations
## [1] 1.03699
mae.freight_iterations
## [1] 0.8416413
mape.freight_iterations
## [1] 0.6820952

From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close but little better than the previous two models.

# Create a data frame with the computed values
results <- data.frame(
  Model = c("freight_3_1_2", "freight_0_1_1", "freight_0_1_0"),
  ME = c(me.freight_iterations, me.freight_prof, me.freight_auto_arima),
  MPE = c(mpe.freight_iterations, mpe.freight_prof, mpe.freight_auto_arima),
  MSE = c(mse.freight_iterations, mse.freight_prof, mse.freight_auto_arima),
  MAE = c(mae.freight_iterations, mae.freight_prof, mae.freight_auto_arima),
  MAPE = c(mape.freight_iterations, mape.freight_prof, mape.freight_auto_arima)
)

# Print the table
print(results)
##           Model        ME       MPE      MSE       MAE      MAPE
## 1 freight_3_1_2 0.5212157 0.4190192 1.036990 0.8416413 0.6820952
## 2 freight_0_1_1 0.5393900 0.4337677 1.066341 0.8575744 0.6950459
## 3 freight_0_1_0 0.5041667 0.4051070 1.029583 0.8458333 0.6856508

Based on the provided metrics, the model “freight_0_1_0” has the lowest values for Mean Error (ME), Mean Percent Error (MPE), Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE). Therefore, the “freight_0_1_0” model performs better compared to the other models listed.

Based on the coefficients and their standard errors, the ARIMA model with order (3, 1, 2) appears to be better as it includes multiple autoregressive (AR) and moving average (MA) terms, indicating a more complex and potentially better capturing of the underlying patterns in the data. Additionally, the standard errors of the coefficients in the (3, 1, 2) model are relatively smaller compared to the (0, 1, 0) model, suggesting more precise estimates. However, model selection should also consider other factors such as model fit diagnostics and forecast performance.

Following is the code to verify for the arch effect on the data where it does not have the covid data.

freight_diff = diff(log(train_transportation))*100; plot(freight_diff, type = "l"); abline(h=0)

par(mar=c(4.5,5,1.5,1), mfcol=c(2,3))
acf(freight_diff, main="return", xlab="Lag", lag=50)
pacf(freight_diff, main="", xlab="Lag", lag=50)
acf(abs(freight_diff), main="abs return", xlab="Lag", lag=50)
pacf(abs(freight_diff), main="", xlab="Lag", lag=50)
acf(freight_diff^2, main="sqr return", xlab="Lag", lag=50)
pacf(freight_diff^2, main="", xlab="Lag", lag=50)

library(TSA)
## Registered S3 methods overwritten by 'TSA':
##   method       from    
##   fitted.Arima forecast
##   plot.Arima   forecast
## 
## Attaching package: 'TSA'
## The following object is masked from 'package:readr':
## 
##     spec
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
McLeod.Li.test(y=freight_diff)

skewness(freight_diff)
## [1] -1.520745
kurtosis(freight_diff)
## [1] 6.456526
qqnorm(freight_diff); qqline(freight_diff)

shapiro.test(freight_diff)
## 
##  Shapiro-Wilk normality test
## 
## data:  freight_diff
## W = 0.91464, p-value = 1.067e-08

From the above x square return plots, mcleod test we can observe that there is no arch effect on the data. However the data is not normal based on the q-q- plot and shapiro wilk test.